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Abstract 

Conformal predictors, introduced by Vovk et al. [16], serve to build prediction intervals 
by exploiting a notion of conformity of the new data point with previously observed data. 
In the present paper, we propose a novel method for constructing prediction intervals for 
the response variable in multivariate linear models. The main emphasis is on sparse linear 
models, where only few of the covariates have significant influence on the response variable 
even if their number is very large. Our approach is based on combining the principle of 
conformal prediction with the l\ penalized least squares estimator (LASSO). The resulting 
confidence set depends on a parameter e > and has a coverage probability larger than 
or equal to 1 — e. The numerical experiments reported in the paper show that the length 
of the confidence set is small. Furthermore, as a by-product of the proposed approach, we 
provide a data-driven procedure for choosing the LASSO penalty. The selection power of 
the method is illustrated on simulated data. 

Keywords: LASSO, LARS, Sparsity, Variable selection, Regularization path, Confidence 
set. 
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1 Introduction 

Consider observations (x{,yi) € R p x R for % > 1 from a linear regression model yi = + 
where (5 € W is the unknown parameter and the £,'s are the noise variables. Suppose we have 
already collected the dataset £ n = ((xi,yx), . . . , (x n -i,y n -i),x new ) where x new € W denotes 
a new observation. Our goal is to predict the label y new corresponding to x new based on £ n and 
then exploiting the information in x new . This setup is known as the transduction problem [12] . 
Our estimation strategy is based on local arguments in order to produce a better estimation 
for y new [5]. More precisely, we will follow the approach of conformal prediction presented 
by Vovk et al. [16] which relies on two key ideas: one is to provide a confidence prediction 
(namely, a confidence set containing y new with high probability) and the other is to account 
for the similarity of the new data. %new 

compared to the previously observed Xj's. The notion 
of conformal predictor was first described by Vovk et al. [15]. Moreover, in [16j . the authors 
illustrate this approach on the example of ridge regression. Along the paper, this predictor 
will be referred to as Conformal Ridge Predictoi0 (CoRP). In the present contribution, we 
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propose to adapt conformal predictors to the sparse linear regression model, that is a model 
where the regression vector (5 € R p contains only a few of nonzero components. We introduce 
a novel conformal predictor called the Conformal Lasso Predictor (CoLP) which takes into 
account the sparsity of the model. Its construction is based on the LASSO estimator [TO]. The 
LASSO estimator for linear regression corresponds to an ^-penalized least square estimator 
and it has been extensively studied over the last few years (0 CO HSj, among others) and 
several modifications have been proposed ([20l [THl Ell ttH E] among others). One attractive 
aspect of the LASSO is that it aims both to provide accurate estimating while enjoying 
variable selection when the model is sparse. In the approach considered in the present paper, 
the resulting Conformal Lasso Predictor has a large coverage probability and are small in 
term of its length in the same time. When we deal with regularized methods like the Ridge 
or the LASSO estimators, the choice of the penalty is an important task. Contrary to the 
Conformal Ridge Predictor for which no rule was established to pick the Ridge-penalty |16j . 
the construction of the Conformal Lasso Predictor provides a data-driven way for choosing 
the LASSO-penalty. Moreover, it turn out that this choice is adapted to variable selection as 
supported by the numerical experiments. 

The paper is organized as follows. We concisely introduce conformal prediction and the 
LASSO procedure in Section [2] and Section [3] respectively. In Section HI we give the explicit 
form of the Conformal Lasso Predictor. An algorithm producing the CoLP is presented in 
Section Then in Section [6] we discuss a generalization of the Conformal Lasso Predictor 
to other selection-type procedures; we call these generalized procedures Sparse Conformal 
Predictors. Finally, in Section [3 we illustrate the performance of Sparse Conformal Predictors 
through some numerical experiments. 

2 Conformal prediction 

Let us briefly describe the approach based on conformal prediction developed in the book 
by Vovk et al. [16] where they develop the idea of conformal prediction. In order to predict 
the label y n ew 

of ci new observation x n — %newi 

the similarity of pairs of the form (x new ,y), 
where y £ R, to the former observations (xi,yi) for i = 1, ... ,n — 1 is exploited. This is 
the purpose of introducing a nonconformity score a(y) = (oi(y), . . . ,a n (y))' which is based 
on £ n . Each component on describes the efficiency of explaining the observation (xi,yi) by 
a procedure based on the augmented sample {(xi,y%), . . . , (x n _i,y n _i), (x new ,y)}. In order 
to obtain a relative information between different nonconformity scores on, we shall use the 
notion of p-value, as introduced in [16] . defined as: 

p(y) = ~l {* G {!,••• : a i(y) > «n(y)} I , (i) 

n 

where for any set A, we denote its cardinality by \A\. The above quantity lies between 1/n 
and 1. Moreover, we note that the smaller this p-value is, the less likely the tested pair 
(xnew,y) is (in other words, y is an outlier when associated to x new ). An explicit form of the 
nonconformity score and the p-value will be given in Section H] when we will adapt it to the 
CoLP. 

Remark 1. The notion of p-value introduced in the present paper differs from the classical 
one. To make the connection with hypothesis testing in mathematical statistics JE/, consider 
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the following hypotheses: 



{ 



Hq : the pair (x new ,y) is conformal, 
H\ : the pair (x new ,y) is not conformal. 



Assume the observation Y = y is given. The function p{y) permits to construct a statistical 
test procedure with critical region TZ e = {y : p(y) < e} and Hq is rejected if y € 1Z £ . 

A nice feature of this nonconformity score is that it can be related to the confidence of 
the prediction for y new . We now recall the concept of conformal predictor introduced in [16J . 
Set e € (0, 1). Given the new observation x new , we search for a subset T £ = T £ (£ n ) of R, in 
which the expected value of y new lies with a probability of 1 — e. The conformal predictor 
T £ is defined as the set of labels y € R such that p(y) > e. In other words, V s consists of 
labels y which make the pair (x new ,y) more conformal than a proportion e of the previous 
pairs (xi,y{) for i = 1, . . . ,n — 1. Note moreover that the smaller e, the more confident the 
predictor. That is to say, for any e\, > 0: 



In the present analysis, apart from prediction, we develop an approach for selecting rel- 
evant variables. For this reason, we consider three criteria measuring the quality of our 
procedure: validity, accuracy, and selection. The first two were introduced in [17]. The fact 
that we consider the issue of sparsity leads us to include the selection power of the predictor. 

Validity. This criterion accounts for the power of conformal prediction. The simplest ap- 
proach is to count the number of times where y n does not belong to the set T e . We take 
the notation: 



Note that in an on-line perspective, one focuses on the cumulative error ERR^ = 
X^iLi err f • Asymptotic validity properties of this cumulative error have been stud- 
ied in [T3] and [161 chapters 2 and 8]. In the present work, we will be interested in 
evaluating the error err^ for a fixed n, rather than the cumulative one. 

Accuracy. The length of the confidence predictor provides a natural measure of the accu- 
racy. We will see that such a measure is adapted to the variable selection purpose. Note 
that other choices are possible. We shall discuss this point in Section 

Selection. Finally, in the case of sparse linear regression, it is important to include a measure 
of the capacity of the estimator to select relevant variables, namely those for which the 
regression parameter (3 has nonzero components. 



The LASSO estimator [10] has originally been introduced in the linear regression model: 



pei ^ -pe 2 whenever e\ > E2 ■ 





otherwise. 



3 The LASSO Procedure 



y-i = x'ifl* + & 



i = 1, . . . , n — 1 



(2) 
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where the design Xi = (a^i, • • • ,%i,p)' G K p is deterministic, (3* = (/?*, . . . , (3*)' G M p is the 
unknown regression vector and the £j's are independent and identically distributed (i.i.d.) 
centered Gaussian random variables with known variance a 2 . Then the goal is to use the 
observations to provide an approximation of the label y new of a new observation x new through 
the estimation of the regression vector (3*. The LASSO estimator is defined as follows: 

n— 1 p 

/3 X = argmin V] (yj - x [0) + A V, |/% | , (3) 

where A > is a tuning parameter. Based on f3\, an estimation of the response y new of the 
new observation x n = x new is produced by fi\ = x' new [3\. For a large enough A, the LASSO 
estimator is sparse. That is many components of (3\ equal zero. Therefore we can naturally 
define a sparsity (or active) set as A\ = {j G {1, . . . ,p} : (3\ ^ 0}. A LASSO modification 
of the LARS algorithm [3] can iteratively provide approximations of the LASSO estimator 
for a few values of the tuning parameters A = Ao , • • • , Xk such that oo = Ao > • • • > Xk = 
(the indices refer to the algorithm steps and K denotes the last step). These points are the 
so-called transition points. 

From now on, let us write (3k and Ak for the LASSO estimator (3\ and the sparsity set A\ 
evaluated at the transition point A = Afc. Obviously, the estimator (3k is an \Ak\ -dimensional 
vector where \Ak\ is the cardinality of the set Ak- Furthermore, we denote by Sk the \Ak\- 
dimensional sign vector whose components are the signs of the components of the LASSO 
estimator evaluated at the transition point A/% (i.e., (sk)j = 1 if (Pk)j > 0, (sk)j = —1 if 
(flk)j < where j G Ak)- Finally, let us denote by x^, the (n — 1) x \Ak\ matrix whose 
columns are the variables Xj, with indices j G Ak- For each A^, we assume that the matrix 
(x^Xfc) -1 is invertible. Here are some characteristics of the LARS algorithm and we refer 
to [2] for more details: 

i) At each iteration of the algorithm (i.e., at each transition point), only one variable 
Xj = (xij, . . . ,x n -ij)', j = 1,. . . ,p is added (or deleted) to the construction of the 
estimator according to its correlation with the current residual. The algorithm begins 
with only one variable and ends up with the ordinary least square (OLS) estimatoio. 

ii) For each A G (A^+i, Xk], the LASSO estimator can be expressed in the following form: 

/My,x fc ,Sfc) = (x' fc x fc ) _1 (x' fc y - ^s fc ), (4) 

where y = (yi, . . . , y n ~\)' ■ Note that ([!]) is obtained by minimizing ^ over the set Ak- 
Let us also mention that the set Ak and the sign vector Sk remain unchanged when A 
varies in the interval (A^+i, Xk}- 

iii) As highlighted by (JH), the LASSO estimator is piecewise linear in A and linear in y for 
every fixed A [9]. Using the LASSO modification of the LARS algorithm, this property 

2 When p > n, the LARS cannot select all p variables. It is limited by the sample size n. In such a case, 
the last iteration does not correspond to the OLS. 
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helps us to provide the regularization path of the LASSO estimator, which is defined 
as {/3\ : A £ [0,oo)} (each point of the regularization path corresponds to the evalua- 
tion of the regression vector estimator for a given value of A). Indeed, the slope of the 
LASSO regularization path changes at a finite number of points which coincide with 
the transition points Ai, . . . , Xk- 

iv) Piecewise linearity is an important property of the LASSO modification of the LARS 
algorithm. Indeed, let A E (X k +i,Xk] where Afc+i and X k are two transition points. In 
this interval, the LASSO estimator (5\ uses the same variables (variables with indices 
in A k ). By using (j4]), it is easy to see [22] that the linearity of the LASSO estimator 
implies that, for any A S (Ajfc+i,A&]: 

n-1 2 n-1 2 

i=l i=l 

This last observation indicates that the transition points are the most interesting points 
in the regularization path. 

All these nice properties encourage the use of the LASSO as a selection procedure. In the 
sequel, we will consider the LASSO modification of the LARS algorithm which provides an 
approximate solution to the LASSO. 

Remark 2. Through the paper, one should keep in mind the analogy between each iteration k 
of the modification of the LARS algorithm and its corresponding tuning parameter value Xk- 
Decrease of tuning parameter A is reflected through the increase of the number of iterations 
of the modification of the LARS algorithm. 

4 Sparse predictor with conformal Lasso 

For the reasons exposed above, we focus on the transition points Ai, . . . , Xk and construct 
conformal predictors for each of these A^. We then propose to select the best conformal 
predictor among them according to its performance in terms of accuracy (cf. Section [2]). 

Now let us detail the construction of the CoLP for each X k . To this end, denote by 
Xj = (xij, . . . , x n -ij,x new j)' , j = 1, . . . ,p the augmented variable j. Define the augmented 
matrix x = (xi, . . . , x n -\, x new ) = (X\, . . . , X p ) and the augmented response vector y = 
(yi, . . . , y n -x, y) where y is a candidate value for y nem . Using the notation introduced in 
Section [3l for the fixed A&, we also define the LASSO estimator Pk(y,^-k,Sk) from expression 
(jlj) with the augmented data. From now on, we denote this estimator by (3k- Define /}& := 
SckPk- Moreover, the matrix will be the n x n projection matrix onto the subspace 
generated by and I identity matrix of the same size. For each Afc, we define a corresponding 
nonconformity score a k = (a^, . . . , a^) by: 

a k {y) ■= |y - Afc I = I (I - H fc ) y + yXfc (x fc x fc ) ^ s k \ 

= \A k + B k y\, 

where | • | is meant here componentwise and 

f A k = (a k 1 ,...,a^Y :=(I-Uk)(yi,...,y n -i,0)' + ^3i k (Si' k Sik)' 1 Sk, 

1 B k = (b k ,...,b k n )':=(I-H k )(0,...,0,l)', {0> 
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Note that each component a k (y) is piecewise linear with respect to y. Then the corresponding 
p- value Pk(y) as defined by ([1]) clearly can change only at points y where the sign of af(y) — 
a k (y) changes. Hence, we do not have to evaluate all the possible values of y. We only focus 
on points y for which the i-th nonconformity measure af (y) equals a k (y). For this purpose, 
we define, for each observation i G {1, . . . , n} 



[y: af(y)>a k (y)}, (6) 



which corresponds to the range of values y such that the new pair (x new ,y) has a better 
conformity score than the i-th pair (xi,yi). Moreover, let if and uf denote two real defined 
respectively as 

jk _ r a i ~ a n . a i + a n i i ..k _ r a i- a n . a i+ a n } (7) 

k - mm{ bf-b k > b k + b^ and ^" max{ bf-b*> bf + b^ (7) 
where af and bf are given by (|5|). 

Proposition 1. Let us fix a k € {1, ... , K} and an i £ {1, . . . , n — 1}. Assume that both b k 
and b\ are non-negative. Then 

i) if b\ ^ b k n , we have either Sf = [l k ;u k ] or S k = (-oo;l k ] U [u k ;-oo), with if and u k 
given by ([7]). 

ii) if b k = b k ^ 0, then if = uf = — and we have either Sf = (— oo;lf] or 
Si = Moreover if a k = a k , we have S k = R. 

Hi) if b k = b k = 0, we have either S k = M or S k = 0. 

The assumption that all the bf are non-negative does not make loose any generality as 
one can multiply a k , bf and cf by — 1 if bf < 0. With this definition of Sf , we may rewrite 
the definition of the conformal predictor as follows 

n n 

T% = {y : > a k n {y)) > ne} = {y : £l(^)(y) > ne}, (8) 

i=l i=l 

where I(-) stands for the indicator function. This approach leads to a whole collection of 
confidence intervals T\ , . . . , T e K . We propose below a strategy for choosing one one particular 
T|, the performance of which will be studied through numerical simulations. 

It is worth mentioning that in view of [141 Theorem 1] (see also [TBI Proposition 2.3 
page 26]), each of predictor r| would have a coverage probability at least equal to 1 — e, if 
the corresponding value of the tuning parameter were deterministic. In fact, the following 
result holds. 

Proposition 2. Fix the significance level e € (0, 1) and the tuning parameter A > 0. Let 
P\,n(y) be the Lasso estimate for the augmented dataset (y,x) and let us define a x (y) = 
[y — 5cf3\ tn (y)\. Then, the conformal predictor 



n 

T\ = {y:J2K^(y)>a^y))>ns}, 
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satisfies 

HVnew 6 T%) >l-e, 

for any n £ N. 

Actually, in the proof of Proposition [2] detailed in |14j . one needs the exchangeability of the 
pairs (xi,y\), . . . , (x n ~i, y n -\)i (x n ,y) in the definition of the predictor. This property is not 
fulfilled when the tuning parameter A is chosen in the set {Ai, . . . , Xk} of Lasso's transition 
points, since the elements of this set depend only on the first n — 1 observations and not on 
(x n ,y). We believe that under some additional assumptions a result similar to Proposition [2] 
can be obtained for the predictor T| as well, for each k = 1, . . . , K. This is the topic of an 
ongoing work. In the present paper, we content ourselves by proposing a data-driven choice 
of the conformal predictor from the collection of predictors {r|; 1 < k < K} and by exploring 
its empirical properties. 

Remark 3. Of course, one can also apply the well-known sample splitting technique for 
choosing the values X\, . . . , Xk based on a first sample, and then use the methodology described 
below for selecting the data-driven predictor based on a second sample which is assumed to be 
independent of the first sample. However, this technique is not attractive from the practical 
standpoint, that is why we do not develop this approach. 

As discussed above, we believe that all the predictors Tf, share nearly the 1 — e validity 
property, which is supported by our empirical study. We suggest to select among them the 
one which has the smallest Lebesgue measure. We denote this confidence set by T e opt , that is 

T lpt = T li v = argmin |T||. (9) 

k 

In general, since v is a random variable, the 1 — e validity of all T| would not imply the 1 — e 
validity of r^pt; but only 1 — Ke validity. However, 1 — Ke is a worst case majorant obtained 
by a simple application of the union bound, whereas numerical examples we considered (some 
of them are reported below) suggest that the validity is much better than 1 — Ke and could 
even be equal to 1 — e when p < n. 

5 Implementation 

We provide here a three-step algorithm which enables us to easily construct the CoLP. We 
start in Step 1 by applying the LASSO modification of the LARS algorithm to the dataset 
((a;i, yi), ■ ■ ■ , (x n -i, y n -{)). This step provides all transition points X\, . . . , Xk, the corre- 
sponding design matrices x& and sign vectors s^ for k = 1,...,K. Then, in Step 2, we 
construct the conformal predictor T| associated to each A&. Thanks to Proposition CH for 
each Afc , we can construct the sets Sf for i = 1 . . . , n defined by ([6]) . We use these sets in 
order to construct the conformal predictor T|. To do this, we take advantage from the fact 
that the function y \— > Yli=i^(^i (v)) ^ s piecewise constant. Furthermore, the endpoints of 
the intervals where this function is constant belong to the set of the all endpoints of intervals 
forming the sets Sf . Thus, to determine T|, we sort the set U consisting of the all endpoints 
of the intervals described in Proposition 1 and include an interval having as endpoints two 
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successive elements of U in T| if the center of this interval belongs to at least [ne] sets S k . 



Algorithme 1 : Lasso Conformal Predictor 
Step 1: Run the LASSO modification of the LARS algorithm on the data set 
((xi,yi),...,(x n _i,y n _i)) 

Step 2: Construct the Conformal Lasso Predictors for each A& £ {Ai, . . . , Xk} begin 
Step 2a: Initialization : Define and as in ©. Set U k < — 
Step 2b: Harmonization 
for i = 1 to n do 
if b\ < then 

a\ = -a k and b\ = —b$ 
end 
end 

Step 2c: Actualize the set U k 
for i = 1 to n do 

if b k ^ b k then 

Add J* and u\ © to U k 

end 

if b k = b k ^ and a* ^ then 

Add l k = u k © to U k 
end 
end 

Step 2d: Sort U k . Let m < — \U k \. Then y( ) < 00 an d 2/(m+l) < 1" 00 

Step 2e: Evaluate N k for j = 1, . . . , m. Initialize N k < — 0. Then actualize 
for i = 1 to n do 
for j = 1 to m do 

if |af + b k y\ > \a k + b k y\ for y G yy+i)) then 

Increment = N k + 1 
end 
end 
end 

Step 2f: For a fixed threshold e > 0, output the conformal predictor 
end 

Step 3: Output the Conformal Lasso Predictor T £ opt as the smallest (w.r.t. their 
Lebesgue measure) confidence set among the constructed conformal predictors 

Finally, in a Step 3, we provide the CoLP, says T^ pt , which is defined as the smallest con- 
fidence set, according to its Lebesgue measure, among the constructed conformal predictors 
r|, k = 1,...,K. According to Proposition [21 each T| is valid. Moreover the criterion for 
choosing the CoLP is adapted to variable selection as conformal predictors constructed here 
for different values of A&, k = 1, ... ,K bring into play different variables. This is illustrated 
in Figure [5] (left) where we constructed the conformal predictors when n = 300. One can 
observe that all the conformal predictors are valid since they contain the true value of the 
label Unew Hence our construction is suitable when the sample size is larger than the number 
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Conformal predictors when n=300 



-y new 

CoLP 



20 30 40 

iteration 



Validity when n=50 



Predictor 
Chosen step 



iteration 



Figure 1: Left: Conformal predictors T| evolution through the iterations of the LASSO modifi- 
cation of the LARS algorithm when n = 300 (the first iteration corresponds to X m ax and the last 
one corresponds to A m i„). The CoLP is drawn in cyan and corresponds to the 34-th iteration. The 
horizontal blue line corresponds to the value of y new . Right: Validity analysis (err^j) of the confor- 
mal predictors T| through the iterations of the LASSO modification of the LARS algorithm when 
n = 50 (the first iteration corresponds to \ ma x and the last one corresponds to X m m)- The CoLP 
is marked by a black square and corresponds to the 57-th iteration. The vertical line represents a 
separation between a stable and an unstable zone. 



of variables (i.e., n > p) but may be not appropriated when p > n. Figure [5] (right) shows 
an example where almost all the constructed conformal predictors T|, k = 1, . . . ,K, using 
the above algorithm are valid. Only six are not. One of them is the selected CoLP (iteration 
57 in Figure [5] (right)) which corresponds to the smallest predictor. In such cases (p > n), a 
correction can be made and other choices for the accuracy measure are possible. We discuss 
this criterion in Section [7J Let us add that we only illustrated the validity of the conformal 
predictors in Figure [5] (right) as the unstable zone (on the right side of the vertical line) makes 
the representation hard to be analyzed. More details are given in Section UJ 

Remark 4. In Step 1 of Algorithm^, we use the LARS algorithm for its ability to generate 
a small number of tuning parameter values of interest. It is an important aspect as it consid- 
erably reduces the computational cost. On-line versions could be implemented by plugging in 
an on-line version of the LASSO solution as in 0]/. The analysis of such on-line versions is 
the object of work under progress. 



6 Extension to others procedures 

In this section we generalize the construction of the confidence predictor to a family of estima- 
tors which includes selection-type procedures as the Elastic- Net [21] and the Smooth-Lasso [6] . 
As for CoLP (Section [4]), we are interested in two properties of estimators: the piecewise lin- 
earity w.r.t. the response y (to easily compute the nonconformity scores Oj, i = 1, ...,n), 
and the piecewise linearity w.r.t the tuning parameter A [9] (to reduce computational effort 
by using a modification of the LARS algorithm). 

We use the same notation as in Section [3] for the LASSO estimator. Set 3 to be an estimator 
of the regression vector 3 based on x and y. Let also s be the sign vector of the estimator 
3. On the other hand, using the notation in Section [H we set ft = x/3 where this time (3 is 
based on the augmented dataset x and y. 



9 



Assumption 1. The estimator ft can be written as: 

jX = u(x,s)y + v(x,s), (10) 
where u(-) and v(-) are piecewise constant functions w.r.t. y. 

As soon as Assumption [1] holds, we can construct a conformal predictor corresponding to 
the estimator fi. Then many estimators can be considered. The CoLP and CoRP obviously 
belong to this class of predictors and we introduce here the Conformal Elastic Net Predictor 
(CENeP) which is a conformal predictor constructed based on the Elastic-Net modification 
of the LARS instead of the LASSO one (Stepl in Algorithm [I]) . This predictor is defined by 
u(x,s) = Xfc^Xfc + ^fclfc) -1 ^ and v(x,s) = -AfcX fc (x / fc x fc ) _1 Sfc where Afc and ^ correspond 
respectively to the LASSO and Ridge tuning parameters in the definition of the Elastic-Net 
estimator and is the \A^\ X \Ak\ identity matrix [21]. In the same way, we can define the 
Conformal Smooth Lasso Predictor (CoSmoLaP) based on a Smooth-Lasso modification of the 
LARS algorithm [6]. Here w(x, s) = x^x^x^. + ^Jfc) _1 x' fc and u(x, s) = — AfcX^x^x^^Sfc. 
The difference between the CoSmoLaP definition the CENeP one is the identity matrix 1^ 
which is replaced by the \Ak\ x \Ak\ matrix J^. whose components are such that (Jfc)^ = 1 
if i = 1 or i = \Ak\ and (Jk)i,i = 2 otherwise. Moreover for (i,j) £ {1, • • • ,Ak} 2 with i ^ j, 
we have (Jk)i,j = — 1 if |* — j\ = 1 an d zero otherwise. Note that the definition of makes 
the CoSmoLaP more appropriated to model with successive correlation between successive 
variables. 

As for CoLP, we can define the nonconformity score of an expected label y associated to 
the estimator fi as follows: 

(ax(y),... ,a n {y))' := |y - p,\ 

= |(I-u(x,s))y -v(x,s)\ 
= \A + By\, 

with 

f A = (ai,...,a n )' := (I - u(x, s)) (yt, . . . , y n _i, 0)' - v (x, s), 
\ B=(h,...,b n y := (I-n(X, S ))(0,...,0,l)', 

and I is the n x n identity matrix. The quantities A and B are the analogues of A^ and B^ 
respectively, when we considered the CoLP at the transition point k = 1,...,K. Then 
replacing A^ and B^ by respectively A and B in Step 2. a of Algorithm Q3 we obtain the 
conformal predictors associated to the estimator jl. 

Note that the dependency in the tuning parameter, noted A, can be included in u(x, s) (as 
for CoRP) or v(3c, s) or in both of them (as for the CoLP). For instance, in the construction 
of the CoLP, this dependency is underlined in the matrix x^ and the sign vector Sk as they 
were computed by the LARS algorithm for a specified value A^ of the tuning parameter A. 

Computational cost of the construction of conformal predictors has also to be considered. 
Three main points interfere. First, one run of the LARS algorithm requires the same cost 
as the computation of the least square estimation. Then we have to consider the number of 
conformal predictors we have to construct: each value of the tuning parameter A provides 
a conformal predictor T\ using the algorithm described in Section [5j The final conformal 
predictor T op t is then the one with the minimal length. As for the CoRP, the main problem 
is: how many A's do we have to test? One way is to use a grid of value for A which lets open 
the problem of the choice of the grid and the window of this grid. 
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On the other hand, we saw how the LARS algorithm permits to reduce considerably the 
number of tuning parameters to be considered. Indeed the grid of tuning parameters values 
is directly described by the transition points Ai, . . . , \k obtained from the run of the LARS 
algorithm. Finally, let us consider the construction of the conformal predictor itself: this 
point has been treated in Vovk et al. [161 Chapter 2.3 and 4.1]. It turns out that sparse 
conformal predictors and the CoLP requires computation time 0(n 2 ) and can be reduced to 
0(nlog(n)). 

7 Experimental Results 

In the section we present the experimental performances of the Sparse Conformal Predictors 
(SCP) w.r.t. their validity, their accuracy and also their selection power. As benchmark, 
we use the CoRfU for its validity and accuracy and the original LASSO and Elastic-Net 
estimators for their selection3 power. 

We consider three SCPs: the Conformal Lasso Predictor (CoLP was introduced in Sections H] 
and [5]) and the Conformal Elastic Net Predictor (CENeP was described in Section [6|). The 
last SCP called Conformal Ridge Lasso Predictor (CoRLaP) is a mix of the CoRP and the 
CoLP. To construct the CoRLaP, we use the variables selected by the LASSO modification 
of the LARS algorithm (Step 1 in Algorithm [T] described in Section [5]). Then we use these 
variables to construct a CoRP. This conformal predictor can be seen as a restricted CoRP. 
All conformal predictors are constructed with confidence level 1 — e = 90%. 

7.1 Simulated Experiments 

We consider four simulations from the linear regression model 

y = X'P + a£, e ~ AA(0, 1), X = (X a , . . . , X 50 )' G K 50 , 

with j3 G M 50 . Hence p = 50 through the simulations. Noise level a and the sample size n are 
let free. They will be specified during experiments. 

Example (a) [n/a]: Very Sparse and Correlated. Here only f3\ is nonzero and equals 5. 
Moreover, the design correlations matrix S is described by = exp(— \j — k\) for 
(j, k) G {15, . . . , 35} 2 and Ej^ = I(j = k) otherwise where I(-) is the indicator function. 

Example (b) [n/a]: Sparse and Correlated. The correlations are defined as in Example (a) 
and the regression vector is given by j3j = — 5 + 0.2j for j = 1, . . . , 5; f3j = 4 + 0.2j for 
j = 10, . . . , 25 and zero otherwise. 

Example (c) [n/a]: Sparse and Highly correlated. We have (3j = 5 for j G {1, . . . , 15} and 
zero otherwise. We construct three groups of correlated variables: Ej^ = 1 when (j, k) 

3 We construct the CoRP associated to same tuning parameters as the CoLP (i.e., the transition points 
observed in Section [5}. Note that the performance would not be inflected as conformal predictors according 
to this method are almost embedded and changes sensitively while the tuning parameter varies. See [161 page 
39] for more details. 

4 We use a BIC-type criterion to select the optimal tuning parameter. Such a criterion is adapted to variable 
selection. 
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Figure 2: Analysis of conformal predictors length (y-axis) through the LASSO modification of 
the LARS algorithm iterations (x-axis: the first iteration corresponds to \ ma x and the last one 
corresponds to X min ) in Example (c) [300/1] (top left) and in Example (c)[50/l] (top right). The 
iteration associated to the CoLP is marked by a blue star. Predictors which are non valid are 
marked by a black circle. The panel of bottom shows the lengths of intervals in a logarithmic scale. 



belongs to {1, ... , 5} 2 , {6, . . . , 10} 2 and {11, 
if j = k and zero otherwise. 



15} 2 ; £ iifc = lfor (j,k) € {16,..., pY 



Example (d) [n/a]: Non Sparse and correlated. Here (3j = 3 + 0. 2j for j £ {1, . . . ,p} and the 
correlations are described by E 3 - & = exp(— \j — k\) for (j, k) G {1, . . . ,p} 2 - 



We consider separately the three points of interest: accuracy, validity and selection. 



Accuracy. First of all, let us consider the length of the predictors r|, k = 1, . . . , K obtained 
at the end of Step 2 in Algorithm[T]described in Section[5j We remind that each of these 
predictors is associated to an iteration of a modification of the LARS algorithm, that 
is the transition points Afc, k = 1, . . . , K. Figure 17.11 illustrates the predictors lengths 
for the construction of the CoLP, when applied to Example (c)[n/l] with n = 300 and 
n = 50. When n = 300, we note that the length of the r|s sensitively changes from one 
iteration to the following and that the larger predictor has a reasonable length compared 
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Table 1: Validity frequencies [with precision ±95%] of the CoRP, CoLP, CoRLaP, CENeP, 
the Early-Stopped CoLP and the 2-PN CoLP based on 1000 replications. 



Example 


a 


CoRP 


CoLP 


CoRLaP 


CENeP 


(a)[300/ct] 


1 


0.897± 0.019 


0.876± 0.020 


0.854± 0.022 


0.878± 0.020 




7 


0.894± 0.019 


0.908± 0.018 


0.894± 0.019 


0.899± 0.019 




15 


0.893± 0.019 


0.893 ± 0.019 


0.879± 0.020 


0.887± 0.020 


(b)[300/ct] 


1 


0.901± 0.018 


0.875± 0.020 


0.869± 0.021 


0.874± 0.021 


(c)[300/crj 


1 


0.900± 0.019 


0.900± 0.019 


0.891± 0.019 


0.901± 0.018 


(d) [300/(7] 


1 


0.892± 0.019 


0.895± 0.019 


0.895± 0.019 


0.895± 0.019 


(A) [50/(7] 


3 


0.887± 0.020 


0.668± 0.029 


0.414± 0.030 


0.789± 0.025 


(A) [20/a] 


3 


0.865± 0.021 


0.596± 0.030 


0.304± 0.028 


0.685± 0.029 


Example 


a 


CoRP 


CoLP 


Stopped-CoLP 


2-PN-CoLP 


(A) [50/a] 


7 


0.853± 0.022 


0.620± 0.030 


0.815± 0.024 


0.881± 0.020 


(B) [50/(7] 


1 


0.875± 0.020 


0.558± 0.031 


0.814± 0.024 


0.907 ± 0.018 


(C) [20/(7] 


15 


0.875± 0.020 


0.608± 0.030 


0.769± 0.026 


0.893± 0.019 


(d) [20/cr ] 


1 


0.900± 0.019 


0.602± 0.030 


0.793± 0.025 


0.892± 0.019 



to the smallest one (about 10 times larger). Then the construction is stable. We also 
observe that in the neighborhood of the optimal iteration (that is iteration 20), the 
conformal predictors have approximately the same size. Such an observation can also 
be made when we take a look at Figure (left) when applied to Example (b) [300/1]. On 
the other hand, when n = 50, it appears that the predictors length grows drastically at 
some iteration (around iteration 85). We even can not compare the lengths of the bigger 
and smaller predictors (more than 10 times larger). In the same time, it seems that 
the construction becomes unstable as violent variations often happen after this iteration 
85. We will consider in the next point the validity of these predictors. However let us 
mention that in Example (c)[50/l], the CoLP which is the smallest T| and then the 
selected predictor is not valid (in Figure [74] (right), the selected predictor at iteration 93 
is not valid). This aspect can also be observed in Figure [5] (right) (the graph corresponds 
to Example (b)[50/l]) where the selected CoLP at iteration 57 is not valid. Similar 
violent variations of the corresponding predictors lengths would have been observed 
after iteration 49 if we have provided a graph as Figure 17711 (right). 

Validity. Now, we consider the validity of the selected predictors (cf. Step 3 in Algorithm [1]). 
As shown in Table [TJ we observe that variations on the noise level, the variables cor- 
relations and the sparsity of the model do to not perturb the validity whereas the 
sample size relatively to the dimension p does. When n = 300 > p, all the procedures 
seem to be quite similar and produce good predictors. In the other cases, i.e., when 
n = p = 50 and n = 20 < p, the selected confidence predictors have worst performance 
than expected (validity with smaller proportion than 1 — e = 90%). Moreover, Sparse 
Confidence Predictors perform worst than the CoRP as observed in Table [1] As pointed 
in the accuracy part, one explication can be observed in Figure ITTTI as the selected pre- 
dictor which also is not valid (iteration 93) corresponds to an iteration in the unstable 
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Figure 3: Variable selection analysis for the CoLP, the CoRLaP and the CENeP in Exam- 
ple (b)[300/l] (variables 1 to 5 and 10 to 25 are relevant; see variables in dark blue on the plot). 
On the left, we consider the CoLP and the CoRLaP selected variables (x-axis) with respect to the 
LASSO modification of the LARS algorithm iterations (y-axis: the first iteration corresponds to 
^max and the last one corresponds to A min ). On the right, we consider the CENeP selected variables 
(x-axis) with respect to the Elastic-Net modification of the LARS algorithm iterations (y-axis: the 
first iteration corresponds to \ m ax and the last one corresponds to X m %n)- The selected iteration is 
marked by red diamonds for the CoLP, green squares for CoRLaP and black squares for the CENeP. 



zone (that is, after iteration 85). Then in order to reduce the gap between SCP and 
CoRP in the cases p > n, we suggest to modify the selection criterion in Step 3 in two 
ways, i) Early Stopping CoLP: do not consider (and do not construct) all the conformal 
predictors T|. Stop the construction of the predictors T| as soon as the length of T| 
(predictor at iteration k) has a length at least 10 times larger than Ft —1 ; ii) N Previous 
Neighbors CoLP: we can enforce the Early Stopping rule by considering as final pre- 
dictor: T £ opt = Uj-o<fc-j<A r where k is the index of the (selected) smallest predictor 
and N is the number of neighbors we consider. Note that this method does not alter 
selection properties as T £ k is usually constructed with more variables than Tp j < k. 
It further does not alter a lot the accuracy as the Early Stopping rule ensures that we 
are in stable zone (cf. Figure \77i\ (right) and Figure [5] (right)). Table Q] sums up the 
performances of the early-stopped CoLP and the 2-PN CoLP in term of validity. We 
observe the good adaptation of both methods to the case p = n and we remark that 
2-PN CoLP nicely produce valid predictor even in the case p > n. This improvement in 
the term of validity can also be illustrated by Figure [5] (right) where we observe that in 
Example (b)[50/l], the early-stopped CoLP is valid whereas the original CoLP is not. 



Selection. The selection ability of Sparse Conformal Predictors is here in concern. First, note 
that the selected variables in SCPs are directly linked to the selection ordering through 
the iterations of the LASSO or Elastic-Net modification of the LARS algorithm. Then, if 
the used modification of the LARS algorithm fails to recover the true model, we can not 
hope to get a predictor which contains only the true variables. Figure [77T1 illustrates the 
evolution of the variable selection of CoLP, CoRLaP and the LASSO on one hand and 
the CENeP and the Elastic-Net on the other hand, in Example (b) [300/1]. It turns out 
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that CoLP and CENeP select larger model that expected (that is, some noise variables 
are selected), as the LASSO and the Elastic-Net do. Moreover CoRLaP uses to select 
a smaller subset of variables than the CoLP. Then it often produces a better variable 
selection performance than the other methods. It often provides closer model to the 
true one. Compared to the LASSO, it seems that the CoLP and the CoRLaP perform 
better in this example. However, we can not conclude the superiority of the CoLP on 
the LASSO in term of variable selection. A similar conclusion can be given when we 
compare the CENeP and the Elastic-Net. Nevertheless, the CENeP seems to select 
little larger models than the Elastic-Net. Finally, analogously to the superiority of the 
Elastic-Net compared to the LASSO, we can remark that the CENeP manages to have 
better selection performances compared to the CoLP and the CoRLaP when a group 
structure may exist between different variables (for instance in Example (d)[n/cr]). This 
is due to the LASSO modification of the LARS algorithm which uses to select some noise 
variables before relevant ones in such cases. 

7.2 Real data 

We applied SCPs on 150 randomly permutations of the House Boston datasetH, in which we 
randomly choose one row to be the new pair (x new ,y new ). The original dataset consists of 506 
observations with 13 variables. When we consider variable selection, we note that almost all 
SCPs are constructed without the variable Xj = (x±j, . . . , £505,7). This variable is selected 
with frequencies lower than 3%. The CoRLaP also does not consider the variable A3 as 
relevant with a frequence equal to 17%. Conforming to Section [7.11 we would better consider 
X% irrelevant as the CoRLaP uses to produce better performance when variable selection is 
in concern. Then we conclude that the proportion of non-retail business acres per town and 
the proportion of owner-occupied units built prior to 1940 do not interfere in the value of 
owner-occupied homes. We also can notice that variable selection sligtly improved accuracy 
of conformal predictors in all presented experiments. Here, we can for instance remark that 
the median lengths of the CoLP, the CoRLaP and the CENeP are respectively 13.61, 13.50 
and 13.58, whereas CoRP length is 14.45. 

8 Conclusion 

We presented Sparse Conformal Predictors, a family of l\ regularized conformal predictors. 
We focused on LASSO and Elastic-Net versions of these Sparse Conformal Predictors. We 
illustrated their performance in term of accuracy, validity and variable selection. We con- 
cluded that such Sparse Conformal Predictors are valid and nicely exploit the sparsity of the 
model when the sample size is larger than the the number of variables (i.e, when n> p). We 
also provided a way to adopt these sparse predictors to the case p > n through a pair of rules 
we called Early Stopping and N Previous Neighbors rules. 

Several extensions of this work can be explored such as the construction of SCP with Adaptive 
LASSO [20j and they will be investigated in future work. 

□ 

Acknowledgement. We would like to thank Professor Arnak Dalalyan and Professor Nicolas 
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5 The data and their description are available at http://archive.ics.uci.edu/ml/datasets/Housing 
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